ICU-23350 Support name aliases of type correction in UnicodeSet \N by eggrobin · Pull Request #3918 · unicode-org/icu

eggrobin · 2026-03-30T15:29:21Z

Currently, ICU4C rejects the UnicodeSet [\N{PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET}], and accepts only the misspelling [\N{PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET}]. With this change, both expressions are valid and equivalent to [︘].

Name aliases of other types do not work (ICU-8963). UAX44-LM2 loose matching does not work as specified (ICU-3736).

Checklist

Required: Issue filed: ICU-23350
Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable
Approver: Feel free to merge on my behalf

eggrobin · 2026-03-30T15:31:11Z

@markusicu I thought this was covered by ICU-8963, but while that ticket mentions it in passing at the end, it is about something else.

Shall I file a ticket? I think I could self-approve such a ticket, since the TC has approved the proposal https://groups.google.com/a/unicode.org/g/icu-design/c/Tx2123ejkTI/m/ZDLGzY-XDQAJ which included this (under Name aliases and UAX44-LM2 in \N and \p{Name=…}).

markusicu · 2026-03-30T21:35:41Z

@markusicu I thought this was covered by ICU-8963, but while that ticket mentions it in passing at the end, it is about something else.

Right. That one is about changing the data structure so that we can store multiple aliases, and adding API (constants) for selecting among them. Maybe new API altogether. And a little bit of UnicodeSet parser work to wire in API changes as needed.

Shall I file a ticket?

Yes -- I can't find one for this.

I think I could self-approve such a ticket, since the TC has approved the proposal [...] which included this

+1

markusicu · 2026-03-30T21:38:23Z

icu4c/source/common/uniset_props.cpp

+                    for (const UCharNameChoice nameChoice :
+                         std::array{U_EXTENDED_CHAR_NAME, U_CHAR_NAME_ALIAS}) {
+                        ec = U_ZERO_ERROR;
+                        UChar32 ch = u_charFromName(nameChoice, buf, &ec);


FYI: The feature is good, but it will make this even slower :-(
We should really add an API that we ask to consider all aliases, or some set of several aliases. (Different ticket & PR)

icu4c/source/common/uniset_props.cpp

markusicu · 2026-03-31T18:00:17Z

icu4c/source/common/uniset_props.cpp

+                    const UChar32 result = getCharacterByName(
+                    std::u16string_view(pattern_).substr(start, parsePosition_.getIndex() - 1 - start));
+                    if (result == U_SENTINEL || (hex.has_value() && result != hex) ||
+                        (literal.has_value() && result != literal)) {


please indent more than the if-body

Done. We should probably figure out how to tell clang-format to do that, it indents where the ( is, and if ( is the same length as our indent…

i don't know why it does that; in google3 (where the indent is 2 spaces), the effect is that the continuation is indented more than the body; it should make that work with a larger indent as well

Well, it does what it does because that works for google3 and other major users.

It seems that AlignAfterOpenBracket: DontAlign can work, but it is a very blunt hammer: https://discourse.llvm.org/t/misleading-indentation-on-if-statements-with-clang-format-alignafteropenbracket/89069, see https://clang.llvm.org/docs/ClangFormatStyleOptions.html#alignafteropenbracket.
Maybe https://clang.llvm.org/docs/ClangFormatStyleOptions.html#breakafteropenbracketif would work?

With the clang-format I have on my machine,

ContinuationIndentWidth: 8 AlignAfterOpenBracket: DontAlign

works; I don’t have BreakAfterOpenBracketIf.
But of course it results in unnecessary wide continuation indents elsewhere.
Anyway, another example of why I don’t think clang-formatting everything in ICU is happening soon…

icu4c/source/common/uniset_props.cpp

markusicu

lgtm pse squash

icu4c/source/common/uniset_props.cpp

jira-pull-request-webhook · 2026-03-31T21:25:28Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2026-03-31T21:25:49Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

eggrobin requested a review from markusicu March 30, 2026 15:31

markusicu reviewed Mar 30, 2026

View reviewed changes

eggrobin changed the title ~~Support name aliases of type correction in UnicodeSet \N~~ ICU-23350 Support name aliases of type correction in UnicodeSet \N Mar 30, 2026

markusicu reviewed Mar 31, 2026

View reviewed changes

icu4c/source/common/uniset_props.cpp Outdated Show resolved Hide resolved

markusicu reviewed Mar 31, 2026

View reviewed changes

icu4c/source/common/uniset_props.cpp Outdated Show resolved Hide resolved

icu4c/source/common/uniset_props.cpp Show resolved Hide resolved

markusicu approved these changes Mar 31, 2026

View reviewed changes

icu4c/source/common/uniset_props.cpp Show resolved Hide resolved

markusicu self-assigned this Mar 31, 2026

eggrobin force-pushed the correction branch from 2534ecd to 5df5359 Compare March 31, 2026 21:25

ICU-23350 Support name aliases of type correction in UnicodeSet \N

a535367

eggrobin force-pushed the correction branch from 5df5359 to a535367 Compare March 31, 2026 21:25

eggrobin merged commit d2a2957 into unicode-org:main Apr 1, 2026
98 checks passed

Uh oh!

Conversation

eggrobin commented Mar 30, 2026 • edited by markusicu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

eggrobin commented Mar 30, 2026

Uh oh!

markusicu commented Mar 30, 2026 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markusicu Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markusicu Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

eggrobin Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

markusicu Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

eggrobin Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

eggrobin Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markusicu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jira-pull-request-webhook bot commented Mar 31, 2026

Uh oh!

jira-pull-request-webhook bot commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eggrobin commented Mar 30, 2026 •

edited by markusicu

Loading

markusicu commented Mar 30, 2026 •

edited by atlassian bot

Loading